《计算机应用》唯一官方网站

• •    下一篇

基于注意力机制与词汇融合的中文医学实体识别

罗歆然1,李天瑞2,贾真3   

  1. 1. 西南交通大学 计算机与人工智能学院
    2. 西南交通大学计算机与人工智能学院
    3. 西南交通大学 信息科学与技术学院,成都 610031
  • 收稿日期:2023-02-27 修回日期:2023-04-12 发布日期:2023-08-14 出版日期:2023-08-14
  • 通讯作者: 罗歆然
  • 基金资助:
    面向跨域阿尔茨海默症医疗大数据的半监督联邦学习研究;融合临床大数据的心理生理语义表征与情感障碍智能辨识方法研究;面向城市时空大数据的深度协同融合与跨域联邦学习技术研究

Chinese medical named entity recognition based on self-attention mechanism and Lexicon enhancement

  • Received:2023-02-27 Revised:2023-04-12 Online:2023-08-14 Published:2023-08-14
  • Contact: Xin-Ran LUO

摘要: 摘 要: 针对中文医学文本实体嵌套导致的单词边界识别困难问题以及现有栅格结构集成词汇特征所面临的语义信息损失严重的情况,提出一种用于中文医学命名实体识别的自适应词汇信息增强模型。首先,利用双向长短期记忆(BiLSTM)网络编码字符序列的上下文信息并捕捉较长距离的依赖关系;然后,对字符序列中每个字符的潜在单词信息进行字词对建模,采用自注意力机制(Self-Attention)实现不同单词之间的内部交互;最后,通过基于双线性注意力机制(Bilinear Attention)的词汇适配器(Lexicon Adapter)将词汇信息集成到文本序列中的每个字符中,有效增强语义信息的同时充分利用单词丰富的边界信息,并抑制相关性低的单词。实验结果表明,该模型与基于字符的基线模型相比,平均F1值分别提升了1.94%~2.38%,并在与BERT结合之后,取得了最优的效果。

关键词: Named entity recognition, Chinese medical text, Lexicon adapter, Self-attention mechanism, Bidirectional Long Short-Term Memory (BiLSTM)

Abstract: Abstract: To address the difficulty of word boundary recognition stemming from nested entities in Chinese medical texts, as well as significant semantic information loss in existing Lattice-LSTM structures with integrated lexical features, an adaptive lexical information enhancement model for Chinese medical named entity recognition is proposed. The proposed model utilizes BiLSTM networks to encode contextual information of character sequences and capture long-distance dependencies. Potential word information of each character is modeled using the Char-words pair sequence , and the self-attention mechanism is utilized to realize internal interactions between different words. Finally, a Lexicon adapter, based on the bilinear-attention mechanism, integrates lexical information into each character in the text sequence. This enhances semantic information effectively while fully utilizing the rich boundary information of words and suppressing words with low correlation. Experimental results demonstrate that the average F1 value of the proposed model increases by 1.94% to 2.38% compared to the character-based baseline model, and its performance is further optimized when combined with BERT.

Key words: 命名实体识别, 中文医学文本, 词汇适配器, 自注意力机制, 双向长短期记忆网络

中图分类号: